Skip to content

Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support#1695

Open
SlightwindSec wants to merge 12 commits intobitsandbytes-foundation:mainfrom
SlightwindSec:upstream_main_npu_enabled
Open

Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support#1695
SlightwindSec wants to merge 12 commits intobitsandbytes-foundation:mainfrom
SlightwindSec:upstream_main_npu_enabled

Conversation

@SlightwindSec
Copy link

@SlightwindSec SlightwindSec commented Jul 2, 2025

What does this PR do?

This PR ports Ascend NPU backend changes from the multi-backend-refactor branch and integrates with custom ops. It includes changes to enable Ascend build and translation of kernels and ops to Ascend-compatible operators. As the AscendC-based high-performance NF4 implementation is still in progress, a temporary PyTorch version is used for now. The build steps remain the same as before from the user's standpoint.

Collaborators

@ji-huazhong @Ginray @Runningwater23 @jiaqiw09
cc @Titus-von-Koeller @matthewdouglas @amathews-amd @sunway513

@SlightwindSec SlightwindSec changed the title Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support [WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support Jul 4, 2025
@matthewdouglas matthewdouglas self-requested a review July 8, 2025 16:41
@matthewdouglas matthewdouglas self-assigned this Jul 8, 2025
@matthewdouglas matthewdouglas added the Ascend NPU Related to Ascend NPU backend label Jul 8, 2025
@unlizi
Copy link

unlizi commented Aug 7, 2025

​​Error Summary:​​

Encountered a vector core execution failure on Ascend 910B3 NPU while running Qwen-image NF4 quantized model inference. The NPU reported multiple DDR memory access violations (error code 0x800000) across 12 compute cores, specifically during dequantize_blockwise_fp32_nf4_1kernel execution. The system threw ACL synchronization error (code 507035) when attempting tensor device transfer (pos_freqs.to(device)).

​​Technical Breakdown:​​

​​Hardware-Level​​: Multiple cores (5-15,20-22) triggered MTE (Memory Tagging Extension) faults indicating invalid DDR address access
​​Software Stack​​:
Framework: PyTorch 2.6.1 + torch-npu 2.6.1
CANN: 8.1.RC1
Failing Operation: NF4 dequantization kernel
​​Error Progression​​:
Initial memory range violation → Vector core exceptions → Stream synchronization failure
Error suggests possible memory alignment issue or NPU microarchitecture incompatibility with the quantization pattern

(MindSpore) [ma-user qwen-image]$python nf4_inference.py
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/collect_env.py:58: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/latest owner does not match the current owner.
warnings.warn(f"Warning: The {path} owner does not match the current owner.")
/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/collect_env.py:58: UserWarning: Warning: The /usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux/ascend_toolkit_install.info owner does not match the current owner.
warnings.warn(f"Warning: The {path} owner does not match the current owner.")
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
Loading pipeline components...: 0%| | 0/5 [00:00<?, ?it/s]/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch_npu/utils/storage.py:38: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
if self.device.type != 'cpu':
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.30s/it]
Loading pipeline components...: 40%|████████████████████████████████████████▍ | 2/5 [00:20<00:25, 8.42s/it]The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:40<00:00, 13.49s/it]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:03<00:00, 12.79s/it]
0%| | 0/50 [00:00<?, ?it/s][W807 18:53:28.145560476 compiler_depend.ts:57] Warning: EZ9999: Inner Error!
EZ9999: [PID: 624026] 2025-08-07-18:53:28.091.327 The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 5, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111ee1ac0, ccu error info: 0xe150ca367a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
TraceBack (most recent call last):
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:26, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 10, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111e36a00, ccu error info: 0x7836503b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:31, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 11, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111efbc00, ccu error info: 0x7836503b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:32, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 12, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x7300000064, mte error info: 0x200600006b, ifu error info: 0x2000111e75500, ccu error info: 0xc27201af7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:33, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 15, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x680000016f, mte error info: 0x200600006b, ifu error info: 0x2000111ee30c0, ccu error info: 0x8a6d50d07a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:36, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 20, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x680000016f, mte error info: 0x200600006b, ifu error info: 0x2000111edab40, ccu error info: 0x7629795c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:37, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 21, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111ed4ac0, ccu error info: 0x847cc0897a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:38, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 22, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x730000007e, mte error info: 0x200600006b, ifu error info: 0x2000111d982c0, ccu error info: 0x7836503b1a8000de, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:0, tslot:4, thread:0, ctxid:0, blk:39, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 6, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111e40d40, ccu error info: 0x127070cc2b0000b1, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:27, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 7, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111eb3600, ccu error info: 0xd7a244127a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:28, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 8, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111eb8e00, ccu error info: 0x6cf8f06c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:29, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 9, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x6a00000075, mte error info: 0x200600006b, ifu error info: 0x2000111eaef80, ccu error info: 0x7629795c7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:30, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 13, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003ff6, mte error info: 0x200600006b, ifu error info: 0x2000111eabb80, ccu error info: 0x6c13551b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:34, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
The error from device(chipId:6, dieId:0), serial number is 232, there is an aivec error exception, core id is 14, error code = 0x800000, dump info: pc start: 0x124837587968, current: 0x124837588284, vec error info: 0x5c00003fc9, mte error info: 0x200600006b, ifu error info: 0x2000111ed8900, ccu error info: 0x244388b7a800075, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x124100433000.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1434]
The extend info: errcode:(0x800000, 0, 0) errorStr: The DDR address of the MTE instruction is out of range. fixp_error0 info: 0x600006b, fixp_error1 info: 0x20, fsmId:1, tslot:4, thread:0, ctxid:0, blk:35, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1446]
Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1366]
AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1119]
Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=6348, flip_num=0, fault kernel_name=dequantize_blockwise_fp32_nf4_1, fault kernel info ext=none, program id=0, hash=12067931037022988496.[FUNC:GetError][FILE:stream.cc][LINE:1119]
[AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1119]
rtStreamSynchronize execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
(function copy_between_host_and_device_opapi)
0%| | 0/50 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ma-user/work/qwen-image/nf4_inference.py", line 45, in
image = pipe(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/pipelines/qwenimage/pipeline_qwenimage.py", line 655, in call
noise_pred = self.transformer(
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/models/transformers/transformer_qwenimage.py", line 594, in forward
image_rotary_emb = self.pos_embed(img_shapes, txt_seq_lens, device=hidden_states.device)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.10/site-packages/diffusers/models/transformers/transformer_qwenimage.py", line 202, in forward
self.pos_freqs = self.pos_freqs.to(device)
RuntimeError: ACL stream synchronize failed, error code:507035
[W807 18:53:28.156241503 compiler_depend.ts:526] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.111.927 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeUsedDevices)
[W807 18:53:28.157702343 compiler_depend.ts:508] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.113.595 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function npuSynchronizeDevice)
[W807 18:53:28.158988663 compiler_depend.ts:151] Warning: NPU warning, error code is 507035[Error]:
[Error]: The vector core execution is abnormal.
Rectify the fault based on the error information in the ascend log.
EH9999: Inner Error!
rtDeviceSynchronizeWithTimeout execute failed, reason=[vector core exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
EH9999: [PID: 624026] 2025-08-07-18:53:28.114.891 wait for compute device to finish failed, runtime result = 507035.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
TraceBack (most recent call last):
(function empty_cache)
[ERROR] 2025-08-07-18:53:33 (PID:624026, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception

…nabled

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
@SlightwindSec
Copy link
Author

Hi @unlizi , thanks for the review.

I've pushed a fix for the issue you reported. Big thanks to @jiaqiw09 for helping debug and solve this.

Could you please try re-running the validation?

@jiaqiw09
Copy link

jiaqiw09 commented Nov 7, 2025

@SlightwindSec Thanks for your fix. I made several test cases, and they all ran successfully without any bugs.

Environment:

  • Ascend NPU 2025RC3 HDK driver and CANN
  • Python 3.10
  • Torch / Torch-NPU 2.5.1

Inference

I tested both LLaMA3-8B and Qwen3-4B models.
Inference works correctly.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from bitsandbytes.functional import dequantize_4bit
import torch

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")

model_id = "./llama3_8b"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="npu:0"
    trust_remote_code=True,
    dtype=torch.bfloat16,
    quantization_config=quantization_config,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

For training, I used the LLaMA-Factory framework to perform QLoRA fine-tuning with DDP and DeepSpeed ZeRO-3.
Everything worked as expected.

You can reproduce the setup using the script below:
examples/train_qlora/llama3_lora_sft_bnb_npu.yaml

@SlightwindSec SlightwindSec changed the title [WIP] Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support Nov 10, 2025
@jiaqiw09
Copy link

@SlightwindSec @matthewdouglas I have updated docs for install and README, would you mind having a check?

@jiaqiw09
Copy link

jiaqiw09 commented Nov 14, 2025

@matthewdouglas Hi, the conflict in the CMakeLists.txt file have been resolved. Would you mind reviewing the changes and sharing any feedback or suggestions?. As mentioned in #1545, multi-Backend branch was deprecated, we are planning to merge ascend npu 4bit to main branch.

Copy link
Collaborator

@TimDettmers TimDettmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #1695 — Enable Ascend NPU Backend with Custom Ops Integration for NF4 Support

Classification: [feature] [platform] [build] — New NPU backend with native AscendC kernels, build system changes, and device dispatch integration.

Size: 562 additions, 5 deletions across 12 files. Medium-large, new backend PR with new files.

This PR adds an Ascend NPU backend supporting NF4 4-bit quantization and dequantization. It includes a Python-side quantize_4bit implementation, native AscendC dequantization kernels (npu_kernels.cpp), CMake build integration, and cextension.py/__init__.py wiring. The scope is focused on NF4 only (no FP4, no 8-bit ops, no optimizers).

Blocking issues (4):

  1. cextension.py: NPU detection takes priority over CUDA — In the BNB_BACKEND detection chain, is_npu_available() is inserted before torch.cuda.is_available(). Since is_npu_available() only checks whether the torch_npu package is installed (via importlib.util.find_spec), NOT whether an actual NPU device is present, any system that has torch_npu installed alongside CUDA will incorrectly set BNB_BACKEND = "NPU" instead of "CUDA". The XPU backend checks torch._C._has_xpu (actual device presence); the HPU backend checks torch.hpu.is_available(). The NPU backend needs an equivalent device-availability check, not just a package-installed check. Similarly, in get_native_library(), the NPU path would be selected over XPU when both packages are installed.

  2. cextension.py: Broken if/elif chain in get_native_library() — The original code has two separate if blocks: if cuda_specs: ... binary_path = cuda_binary_path followed by a separate if torch._C._has_xpu: .... The PR converts the second block's preceding blank line into an elif is_npu_available(): clause, making the XPU check a separate if that runs unconditionally. The result: on XPU systems with torch_npu installed, the NPU path would be set, then immediately overwritten by the XPU path. But on NPU systems without XPU, the flow is correct only by coincidence. The library loading logic needs proper if/elif/elif chaining across all backends.

  3. npu_ops.cpp: Memory leaktilingHost is allocated with malloc() but never freed. Every call to dequantizeBlockwiseNf4() leaks sizeof(BlockwiseNf4TilingData) bytes. This is called on every dequantization operation (every Linear4bit forward pass), so the leak accumulates during inference and training.

  4. npu_ops.cpp: Hardcoded core countcoreNum is hardcoded to 40 in both the tiling function and the kernel launch blockDim. This is specific to one Ascend chip variant (e.g., 910B). Different Ascend models have different core counts. This should either be queried at runtime or made configurable.

Additional issues (non-blocking but significant):

  1. __init__.py: Redundant and insufficient availability check — The import guard if importlib.util.find_spec("torch") and importlib.util.find_spec("torch_npu") checks for package installation, not device availability. The find_spec("torch") check is redundant since torch is already imported at the top of the file. Other backends use device-availability checks (e.g., torch.xpu.is_available(), torch.hpu.is_available()). The NPU backend should check for actual NPU device availability, e.g., if importlib.util.find_spec("torch_npu"): import torch_npu; if torch.npu.is_available(): ... (following the HPU pattern).

  2. quantize_4bit is pure PyTorch, not a native kernel — The NPU quantize_4bit implementation is a pure Python/PyTorch implementation, while dequantize_4bit uses native AscendC kernels. This asymmetry means quantization may be slow. The PR description acknowledges this ("temporary PyTorch version"), but it should be documented with a TODO in code.

  3. quantize_4bit mutates global stateglobal _NF4_QUANT_TABLE; _NF4_QUANT_TABLE = _NF4_QUANT_TABLE.to(A.device) modifies the shared module-level _NF4_QUANT_TABLE from backends/utils.py. This is not thread-safe and would break if multiple devices are used simultaneously. Other backends using this table should be checked for the same pattern.

  4. NF4 packing order differs from CUDA — The comment in quantize_4bit states "Pack 4-bit values in NPU-compatible order (low nibble first) to match NPU-specific unpacking logic; differs from CUDA's packing." This means NPU-quantized checkpoints are NOT compatible with CUDA-dequantization and vice versa. This is a significant interoperability concern: users who quantize on NPU cannot dequantize on CUDA. This should be prominently documented.

  5. No tests included — The PR adds no test files. For a new backend, there should be at minimum integration tests for quantize_4bit and dequantize_4bit round-trip correctness, and parametrized device tests.

  6. npu_kernels.cpp: Missing newline at end of file.

  7. cextension.py: import importlib placement — The import is added after the from typing import Optional line but before the import torch line, violating the standard library grouping convention (it should be grouped with the other stdlib imports at the top, not after typing).

  8. CMake: npu-smi parsing is fragile — The CMake auto-detection runs npu-smi info | awk ... 'NR==7 {print $3}' which relies on exact output format of npu-smi. Different CANN versions or locales may change the output format. The error message also has a typo: "Auto-detech" should be "Auto-detect".

Security assessment:

  • No network access, no command execution, no serialization exploits, no obfuscation patterns in the Python code.
  • The native C++ code (npu_ops.cpp, npu_kernels.cpp) uses AscendC SDK APIs (ACLRT_LAUNCH_KERNEL, kernel_operator.h). These are vendor-provided APIs analogous to CUDA runtime APIs. The code does not execute arbitrary commands or access the filesystem.
  • The CMake changes execute npu-smi during build configuration (analogous to rocminfo for ROCm). This is a build-time tool invocation, consistent with existing patterns.
  • pythonInterface.cpp changes are properly guarded behind #if BUILD_NPU preprocessor conditionals.
  • The import importlib addition is benign (already used elsewhere in the codebase).
  • No security concerns identified.

Downstream Impact

Risk level: LOW (but see blocking issue #1)

The PR adds a new backend without changing existing APIs. The matmul_4bit change in _functions.py adds and A.device.type != "npu" to an existing condition, which is additive and follows the same pattern used for HPU.

However, blocking issue #1 (NPU taking priority over CUDA when torch_npu is installed) could silently break CUDA users who happen to have torch_npu installed. This would be a regression for existing users.

  • Transformers: Not affected (additive backend)
  • PEFT: Not affected
  • Accelerate: Not affected
  • TGI: Not affected
  • vLLM: Not affected

Cross-PR Conflicts

  • PR #1853 (Add MPS backend): overlaps on __init__.py, README.md — both add new backend imports and README table entries. File-level conflicts likely.
  • PR #1826 (Metal backend with libtorch): overlaps on cextension.py, __init__.py, CMakeLists.txt — similar new-backend additions.
  • PR #1823 (Add mps backend python only): overlaps on cextension.py, __init__.py.
  • PR #1858 (Add k-bit blockwise quantization): overlaps on CMakeLists.txt, csrc/pythonInterface.cpp.

None of these are semantic conflicts — they are independent backend/feature additions that will need mechanical merge resolution.

Verdict: Request changes. The CUDA-priority regression (issue #1), broken if/elif chain (issue #2), memory leak (issue #3), and hardcoded core count (issue #4) must be fixed before this can merge. The missing tests (issue #9) are also a blocking concern for a new backend.

  • Security: Clear
  • Downstream impact: LOW (additive backend), but issue #1 risks CUDA regression
  • Tests: Missing — no tests included for the new backend
  • CI: Not triggered (fork PR, needs maintainer approval)
  • Serialization: NPU packing format differs from CUDA (cross-device checkpoint incompatibility)
  • Cross-PR conflicts: File-level conflicts with 4 open PRs (mechanical resolution)

if torch.version.hip:
HIP_ENVIRONMENT = True
BNB_BACKEND = "ROCm"
elif is_npu_available():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: NPU detection takes priority over CUDA. is_npu_available() only checks if the torch_npu package is installed (via importlib.util.find_spec), not whether an actual NPU device is present. Since this elif comes before torch.cuda.is_available(), any system with torch_npu installed alongside CUDA will incorrectly set BNB_BACKEND = "NPU". This would silently break all CUDA users who happen to have torch_npu in their environment.

The fix: check for actual device availability (e.g., torch.npu.is_available() after importing torch_npu), and place the NPU check after CUDA in the priority chain, following the pattern of XPU/HPU.

binary_path = cuda_binary_path

elif is_npu_available():
binary_path = PACKAGE_DIR / f"libbitsandbytes_npu{DYNAMIC_LIBRARY_SUFFIX}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: Broken if/elif chain. The original code had two independent if blocks here (one for cuda_specs, one for _has_xpu). By inserting elif is_npu_available() between them, the XPU check is now disconnected from the chain. The result is that on a system with both torch_npu and XPU, the NPU binary path is set, then immediately overwritten by the XPU path in the next if. This needs proper if/elif/elif/elif chaining across all backends.

BlockwiseNf4TilingData *tilingHost;
tilingHost = (struct BlockwiseNf4TilingData *)malloc(tilingSize);
uint32_t error = get_dequantize_blockwise_nf4_tiling(blocksize, n, tilingHost);
if (error != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: Memory leak. tilingHost is allocated with malloc() but never freed. This leaks memory on every dequantization call (every Linear4bit.forward()). Add free(tilingHost) after the kernel launch, or use stack allocation since BlockwiseNf4TilingData is a small struct.

tiling->ubSize = 196 * 1024;
uint32_t coreNum = 40;
uint32_t totalPkgNum = (n + blocksize - 1) / blocksize;
uint32_t singleCorePkgNum = (totalPkgNum + coreNum - 1) / coreNum;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: Hardcoded core count. coreNum = 40 is specific to one Ascend chip variant. Different Ascend models have different AI core counts (e.g., Ascend 310 has 2 cores, 910A has 32, 910B has 40). This should be queried from the hardware at runtime or passed as a parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU Related to Ascend NPU backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants